White wine quality: dataset exploration

Load and preview the data:

##  [1] "X"                    "fixed.acidity"        "volatile.acidity"    
##  [4] "citric.acid"          "residual.sugar"       "chlorides"           
##  [7] "free.sulfur.dioxide"  "total.sulfur.dioxide" "density"             
## [10] "pH"                   "sulphates"            "alcohol"             
## [13] "quality"
##        X        fixed.acidity    volatile.acidity  citric.acid    
##  Min.   :   1   Min.   : 3.800   Min.   :0.0800   Min.   :0.0000  
##  1st Qu.:1225   1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700  
##  Median :2450   Median : 6.800   Median :0.2600   Median :0.3200  
##  Mean   :2450   Mean   : 6.855   Mean   :0.2782   Mean   :0.3342  
##  3rd Qu.:3674   3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900  
##  Max.   :4898   Max.   :14.200   Max.   :1.1000   Max.   :1.6600  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.600   Min.   :0.00900   Min.   :  2.00     
##  1st Qu.: 1.700   1st Qu.:0.03600   1st Qu.: 23.00     
##  Median : 5.200   Median :0.04300   Median : 34.00     
##  Mean   : 6.391   Mean   :0.04577   Mean   : 35.31     
##  3rd Qu.: 9.900   3rd Qu.:0.05000   3rd Qu.: 46.00     
##  Max.   :65.800   Max.   :0.34600   Max.   :289.00     
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  9.0        Min.   :0.9871   Min.   :2.720   Min.   :0.2200  
##  1st Qu.:108.0        1st Qu.:0.9917   1st Qu.:3.090   1st Qu.:0.4100  
##  Median :134.0        Median :0.9937   Median :3.180   Median :0.4700  
##  Mean   :138.4        Mean   :0.9940   Mean   :3.188   Mean   :0.4898  
##  3rd Qu.:167.0        3rd Qu.:0.9961   3rd Qu.:3.280   3rd Qu.:0.5500  
##  Max.   :440.0        Max.   :1.0390   Max.   :3.820   Max.   :1.0800  
##     alcohol         quality     
##  Min.   : 8.00   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.40   Median :6.000  
##  Mean   :10.51   Mean   :5.878  
##  3rd Qu.:11.40   3rd Qu.:6.000  
##  Max.   :14.20   Max.   :9.000
## 'data.frame':    4898 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...

There are 13 variables in our dataset. First one is a unique ID. Then there are 11 independent reature variables, and one output variable: quality. It is an integer from 1 to 10. In our data, it ranges from 3 to 9. Median quality is 6 and mean is only slightly lower at 5.88.

Quality histogram

Let’s see a histogram of wines by quality:

We can see that most of the wines fit into the bin with quality equal to 6, average quality. Distribution is normal, and slightly right-skewed.

Factor variables

Fixed acidity:

Volatile acidity:

Citric acid:

Residual sugar:

Chlorides:

Free sulfur dioxide:

Total sulfur dioxide:

Density:

PH:

Sulphates:

Alcohol:

All of the variables are distributed normally. A lot of residual sugar values fall into the same bin between 1 and 2. There’s a few outliers in bin 65-66. Other plots that have distinct outliers are free sulfur dioxide and density.

Table by quality

Let’s see a table of wine counts by quality.

## Source: local data frame [7 x 3]
## 
##   quality     n percent
##     (int) (int)   (dbl)
## 1       3    20    0.41
## 2       4   163    3.33
## 3       5  1457   29.75
## 4       6  2198   44.88
## 5       7   880   17.97
## 6       8   175    3.57
## 7       9     5    0.10

Univariate analysis

There are 4898 wines in the white wine dataset. All of them are variants of the Portuguese “Vinho Verde” wine. There are 11 factor variables and one output variable (quality). Quality can be converted to a factor variable levels from 1 to 10. Most of the wines fall into the average quality (6) - we created a grouped set that demonstrates that. All other variables are numeric, mostly measurements of chemical elements by volume in g / dm^3, except alcohol which is measured in % by volume and pH which is measured on scale of 0 to 14 (solutions with a pH less than 7 are acidic and solutions with a pH greater than 7 are basic). We are interested to know how our input variables affect our output variable.

Explore how quality is affected by other variables

Correlation matrix:

## [1] 4898   13
##                      fixed.acidity volatile.acidity  citric.acid
## fixed.acidity           1.00000000      -0.02269729  0.289180698
## volatile.acidity       -0.02269729       1.00000000 -0.149471811
## citric.acid             0.28918070      -0.14947181  1.000000000
## residual.sugar          0.08902070       0.06428606  0.094211624
## chlorides               0.02308564       0.07051157  0.114364448
## free.sulfur.dioxide    -0.04939586      -0.09701194  0.094077221
## total.sulfur.dioxide    0.09106976       0.08926050  0.121130798
## density                 0.26533101       0.02711385  0.149502571
## pH                     -0.42585829      -0.03191537 -0.163748211
## sulphates              -0.01714299      -0.03572815  0.062330940
## alcohol                -0.12088112       0.06771794 -0.075728730
## quality                -0.11366283      -0.19472297 -0.009209091
##                      residual.sugar   chlorides free.sulfur.dioxide
## fixed.acidity            0.08902070  0.02308564       -0.0493958591
## volatile.acidity         0.06428606  0.07051157       -0.0970119393
## citric.acid              0.09421162  0.11436445        0.0940772210
## residual.sugar           1.00000000  0.08868454        0.2990983537
## chlorides                0.08868454  1.00000000        0.1013923521
## free.sulfur.dioxide      0.29909835  0.10139235        1.0000000000
## total.sulfur.dioxide     0.40143931  0.19891030        0.6155009650
## density                  0.83896645  0.25721132        0.2942104109
## pH                      -0.19413345 -0.09043946       -0.0006177961
## sulphates               -0.02666437  0.01676288        0.0592172458
## alcohol                 -0.45063122 -0.36018871       -0.2501039415
## quality                 -0.09757683 -0.20993441        0.0081580671
##                      total.sulfur.dioxide     density            pH
## fixed.acidity                 0.091069756  0.26533101 -0.4258582910
## volatile.acidity              0.089260504  0.02711385 -0.0319153683
## citric.acid                   0.121130798  0.14950257 -0.1637482114
## residual.sugar                0.401439311  0.83896645 -0.1941334540
## chlorides                     0.198910300  0.25721132 -0.0904394560
## free.sulfur.dioxide           0.615500965  0.29421041 -0.0006177961
## total.sulfur.dioxide          1.000000000  0.52988132  0.0023209718
## density                       0.529881324  1.00000000 -0.0935914935
## pH                            0.002320972 -0.09359149  1.0000000000
## sulphates                     0.134562367  0.07449315  0.1559514973
## alcohol                      -0.448892102 -0.78013762  0.1214320987
## quality                      -0.174737218 -0.30712331  0.0994272457
##                        sulphates     alcohol      quality
## fixed.acidity        -0.01714299 -0.12088112 -0.113662831
## volatile.acidity     -0.03572815  0.06771794 -0.194722969
## citric.acid           0.06233094 -0.07572873 -0.009209091
## residual.sugar       -0.02666437 -0.45063122 -0.097576829
## chlorides             0.01676288 -0.36018871 -0.209934411
## free.sulfur.dioxide   0.05921725 -0.25010394  0.008158067
## total.sulfur.dioxide  0.13456237 -0.44889210 -0.174737218
## density               0.07449315 -0.78013762 -0.307123313
## pH                    0.15595150  0.12143210  0.099427246
## sulphates             1.00000000 -0.01743277  0.053677877
## alcohol              -0.01743277  1.00000000  0.435574715
## quality               0.05367788  0.43557472  1.000000000

Pairs that show moderate linear correlation (0.3 and up) are:

Pairs that show strong linear correlation (0.7 and up) are:

Correlation plot (ggpairs):

Fixed acidity and volatile acidity:

Acidity seems to vary a lot, but faceting by quality, we can see that higher quality wines have acidities concenrated in much smaller range. For fixed acidity, it falls into a range between 6 and 9, and for volatile acidity, it’s below 0.5.

Acidity vs pH:

There is a weak linear dependency between the total of acidities and pH, which makes sense: more acidic wine means lower pH score.

Citric acid and residual sugar:

Here again, we see a very wide variation of the two factors, but just like before, for higher quality wines (7, 8, 9) the spread is not quite so big. Citric acid mostly clumps within the range of 0.3-0.45, and resudual sugar in the range of 0-12.

Residual.sugar vs alcohol:

Free and total sulphur dioxide:

Free SO2 and total SO2 have a dependency that is close to linear, which is also logical.

We see that in good wines (blues and greens), free SO2 should not be too low (mostly between 10-50), but then total SO2 should not be too high (mostly falls below 175).

Total sulfur dioxide vs alcohol:

We can see the moderate linear dependency.

Density and pH:

We are faceting by quality here, and it does not look like density plays any role in quality, the range is pretty much the same on all grid facets. It is curious, because we did see correlation between density and quality in our matrix (it was only 0.31, but higher than many others). As for pH, however, the spread again narrows (3.0-3.6) with better quality wines.

Density vs quality:

And yet, when we plot density vs quality, the linear relationship presents itself.

Density vs residual.sugar:

We can clearly see the strng linear dependency that was pointed out by correlation coefficient.

Density vs alcohol:

Here, correlation coefficient was negative, but even higher than with previous plot.

We are combining chlorides, sulphates and alcohol here, and faceting by quality again. Alcohol pattern is not very clear, but it seems that wines of highest quality (9) are rather heavy on alcohol content (10-13), and a lot of the lighter wines (8-9) fall into 3-5 quality. Best wine comes with lower clorides (0.01 - 0.07) and moderate sulphates (0.1-0.9).

If we plot alcohol content on histogram and color it by quality, it confirms the same theory: higher alcohol content correlates with higher quality.

Final plots and summary

This dataset is quite challenging, because even though the factor variables apparently affect the target variable somehow, it’s hard to pinpoint a descriptive function. There is no linear, exponential, square root, or other discernible dependency between any of the input variables and an output variable. We can, however, make some conclusions as to ranges of our input variables which seem to correlate with “better” values of our output variable. I have chosen some of the plots to demonstrate what I mean, first one is acidity:

While acidity varies a lot within each of the facets, it does not vary quite so much within facets marked 7, 8 and 9. So we can make an assumption that wines that fall within his narrower range of acidity are more likely to be evaluated as high quality. These ranges are 6-9 for fixed acidity and 0.01-0.5 for volatile acidity.

Next plot is sulphur dioxide, colored by quality. Again, good quality corresponds to narrower range of the input variable, which is 10-50 for free SO2 and 50-175 for total SO2.

Looking at chlorides, sulphates and alcohol faceted by quality, we can see the link between higher alcohol content (10-13) and higher quality. Chlorides should be kept low (0.01 - 0.07) and sulphates moderate (0.1-0.9).

A few of the variables were strongly correlated. It may mean that if we model quality based on other factors, we don’t have to include all of the factors into the model. For example, we could only include total sulfur dioxide and not free sulfur dioxide; include alcohol but not include density, which is highly correlated with alcohol:

Reflection

In this dataset, we have 11 input factors that potentially affect the output variable. All of them are normally distributed. Some of them are correlated. It would be tempting to consolidate some, for example, roll up fixed acidity, volatile acidity and citric acid into “total.acidity”, and then because the total is correlated with pH, exclude it entirely from the model. Potentially, we could also drop density, because it’s highly correlated with alcohol, or residual sugar, which is highly correlated with density.

However, I would be reluctant to exlude anything when modeling, so as not to lose the important details. For example, from the dataset description it is clear that not all acids are created equal in terms of wine quality - while citric acid adds “flavor and freshness”, acetic acid leads to “vinegar taste”. I think that the only variable that we could somewhat safely drop is free sulfur dioxide, which is part of total sulfur dioxide.

Also, the only factors that show visible linear correlation with quality are alcohol and density. We already know, that density depends on alcohol, so dropping that, we seem to only have alcohol left as a “good” predictor of quality. But we know, that there’s much more to wine that alcohol content, otherwise pure alcohol would be declared best wine.

Because there is no discernible linear pattern between input and output variable (or a relationship that can be transformed to linear), unfortunately, I can’t fit a linear model to this dataset to predict wine quality using a combination of input factors. But I can make some assumptions as to what values of factor variables would likely be present in good quality wines:

To bild a model of quality, some machine learning approach may also work, such as support vector machines, or a neural network.